{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Local Homology NLP Use Cases: unsupervised word disambiguation\n",
    "\n",
    "In this tutorial we apply **local homology** to study natural language processing (NLP) data. We showcase how to use local homology to disambiguate words. In particular, we focus our analysis to disambiguate the word \"note\": either a musical frequency, a short text, the verb, or money.\n",
    "\n",
    "If you are looking at a static version of this notebook and would like to run its contents, head over to [GitHub](https://github.com/giotto-ai/giotto-tda/blob/master/examples/local_hom_NLP_disambiguation.ipynb) and download the source.\n",
    "\n",
    "## See also\n",
    "  - [Topological feature extraction using VietorisRipsPersistence and PersistenceEntropy](https://giotto-ai.github.io/gtda-docs/latest/notebooks/vietoris_rips_quickstart.html) for a quick introduction to general topological feature extraction in ``giotto-tda``.\n",
    "  - [Local Homology](https://giotto-ai.github.io/gtda-docs/latest/notebooks/local_homology.html), in which we introduce the ``RadiusLocalVietorisRips`` and ``KNeighborsLocalVietorisRips`` transformers used here are introduced.\n",
    "\n",
    "**License: AGPLv3**\n",
    "\n",
    "## The context\n",
    "\n",
    "Most modern machine learning techniques dealing with NLP data need to preprocess text data and trasform them into more standard objects: arrays. The process of transforming a word (or token, i.e. a word stripped out of its ending) into an array is called *word embedding*. \n",
    "\n",
    "There are of course more general techniques than simply transforming the single tokens: sentence embedding is one of the generalisations. However, for the purpose of this notebook, we will stick to word embeddings and in particular we will use a technique called [word2vec](https://en.wikipedia.org/wiki/Word2vec).\n",
    "\n",
    "The disadvantage of any word embedding techniques is that the same token is mapped to the same array. Hence, if two words have the same written form but different meanings (i.e., *homographic words*), such words will anyway be mapped to the same array!\n",
    "\n",
    "## The task \n",
    "\n",
    "Given the above introduction, our task is to *disambiguate words*, i.e. finding a way to differentiate homographic words by their meaning. Our approach consists of analysing the whole sentence in which the word appears and try to deduce the meaning of the word from its context (i.e. neighbouring words). The core idea of our proposal is based on the algebro-topological description of the space of word embeddings.\n",
    "\n",
    "## The main idea\n",
    "\n",
    "We are structuring our analysis on the assumptions that are clearly explained in [this paper](https://arxiv.org/pdf/2011.09413.pdf). In few words, the idea is that a word with multiple meaning sits on the sigular loci of the stratified space of the word embeddings. This sentence is in truth not formally correct, as there is yet no clear notion of what is the canonical topology in the word embedding space; nonetheless, the intuition behind these concepts can be explained pictorially:\n",
    "\n",
    "![sing](images/local_singularity.png)\n",
    "\n",
    "This picture represents the local shape of the word-embedding space: the context words are located over the four cones tipping at the word `mole` (the singularity). Hence, a sentence from a thriller containing the word `mole` would most probably be located on the north-west branch, and so on.\n",
    "\n",
    "## The goal of our exploration\n",
    "\n",
    "We would first like to stress that this notebook is merely exploratory and that there is no aim at making it a fully fledge ML pipeline for word disambiguation. \n",
    "We would really like to understand if the shape of the cones around the singularity (i.e. the `mole` point) can be distinguished with **local homology**: if that is the case, then the geometric shape of a sentence lying on the word embedding stratified space correlates with the meaning of a word! This entails a new disambiguation technique."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import PyData libraries\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "from sklearn.base import BaseEstimator, TransformerMixin\n",
    "from sklearn.preprocessing import StandardScaler, FunctionTransformer\n",
    "from sklearn.pipeline import make_pipeline, Pipeline\n",
    "\n",
    "# Import umap for dimensionality reduction and visualization\n",
    "# umap-learn is not a requirement of giotto-tda, but is needed here.\n",
    "from umap import UMAP\n",
    "\n",
    "# giotto-tda imports\n",
    "from gtda.plotting import plot_point_cloud\n",
    "from gtda.local_homology import KNeighborsLocalVietorisRips\n",
    "from gtda.diagrams import PersistenceEntropy\n",
    "\n",
    "# Import needed NLP libraries\n",
    "# gensim is not a requirement of giotto-tda, but is needed here.\n",
    "from gensim.models import Word2Vec\n",
    "from gensim.test.utils import common_texts\n",
    "from gensim.parsing.preprocessing import remove_stopwords, stem"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The dataset and its preprocessing\n",
    "\n",
    "The dataset contains many occurrences of the word \"note\" with the meanings stated above. The format is `.xml`, out of which we extract the plain text and do the standard preprocessing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Preprocess the data\n",
    "with open(\"data/note.n.xml\", \"r\") as f:\n",
    "    content = f.read()\n",
    "\n",
    "# Split the sentences\n",
    "temp_list = content.split(\"<note.n.\")\n",
    "\n",
    "# Remove stopwords\n",
    "list_of_text = list(map(remove_stopwords, temp_list))\n",
    "\n",
    "# Make lower case and stem\n",
    "list_of_text = list(map(stem, temp_list))\n",
    "\n",
    "refined_list = [list_of_text[i][1 + len(str(i)):-12 - len(str(i))]\n",
    "                for i in range(2, len(list_of_text))]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we extract senteces from our corpus and vectorize their words using ``Word2Vec``:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class PreprocessText(BaseEstimator, TransformerMixin):\n",
    "    \"\"\"A basic class to transform a list of sentences (strings)\n",
    "    into a list of arrays, each with 2 dimensions: (n_words, dim_emb_space)\n",
    "\n",
    "    Note that the output is a list of arrays, each of two dimensions: the\n",
    "    first dimension is the number of words in a sentence, while the \n",
    "    second dimension the word embedding dimension.\n",
    "\n",
    "    The ``item``parameter is useful to select, for the transform, which item\n",
    "    to select.\n",
    "    Set ``item = None`` to get the whole list\n",
    "    \"\"\"\n",
    "    def __init__(self, vector_size=30, window=5, min_count=1, workers=4):\n",
    "        self.vector_size = vector_size\n",
    "        self.window = window\n",
    "        self.min_count = min_count\n",
    "        self.workers = workers\n",
    "\n",
    "    def fit(self, X, y=None):\n",
    "        all_words_in_sentences = list(map(str.split, X))\n",
    "        self.word2vec = Word2Vec(sentences=all_words_in_sentences,\n",
    "                                 vector_size=self.vector_size,\n",
    "                                 window=self.window,\n",
    "                                 min_count=self.min_count,\n",
    "                                 workers=self.workers,\n",
    "                                 seed=11)\n",
    "        return self\n",
    "\n",
    "    def transform(self, X, y=None):\n",
    "        all_words_in_sentences = list(map(str.split, X))\n",
    "        list_of_vect_sentences = [self.word2vec.wv[all_words_in_sentences[i]]\n",
    "                                  for i in range(len(all_words_in_sentences))]\n",
    "        return list_of_vect_sentences\n",
    "\n",
    "pt = PreprocessText()\n",
    "list_of_vect_sentences = pt.fit_transform(refined_list) # all vectorized sentences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The exploratory results\n",
    "\n",
    "Here below we see a couple of sentences containing the word \"note\" with the different meanings described above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"\"\"Example of a preprocessed sentence where 'note' is used as a verb:\n",
    "refined_list[0] = {refined_list[0]}\n",
    "\n",
    "Example of preprocessed sentence where 'note' refers to music:\n",
    "refined_list[1] = {refined_list[1]}\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Local homology pipeline\n",
    "\n",
    "As explained in [Topological feature extraction using VietorisRipsPersistence and PersistenceEntropy](https://giotto-ai.github.io/gtda-docs/latest/notebooks/vietoris_rips_quickstart.html), \"persistence diagrams\" are a common and useful way to store information about the topology and geometry of data. Their content can be summarized and made interpretable even further by means of a variety of featurization methods.\n",
    "\n",
    "We will use a similar topological feature extraction method as in the [Local Homology](https://giotto-ai.github.io/gtda-docs/latest/notebooks/local_homology.html) notebook, and apply it to our Word2Vec embeddings!  In particular, we will fit our local homology transformer on either of the example sentences above (seen as point clouds of word vectors) and transform it on the fixed Word2Vec embedding for the word \"note\", to see if any topological differences arise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize a `KNeighborsLocalVietorisRips` local homology transformer.\n",
    "n_neighbors = (5, 15)\n",
    "homology_dimensions = (0, 1)\n",
    "collapse_edges = True\n",
    "n_jobs = -1\n",
    "kn_lh = KNeighborsLocalVietorisRips(n_neighbors=n_neighbors,\n",
    "                                    homology_dimensions=homology_dimensions,\n",
    "                                    collapse_edges=collapse_edges,\n",
    "                                    n_jobs=n_jobs)\n",
    "\n",
    "# Define a featurization method for persistence diagrams.\n",
    "mod_pe =  make_pipeline(PersistenceEntropy(),\n",
    "                        FunctionTransformer(func=lambda X: 2 ** X))\n",
    "\n",
    "#  Word2Vec embedding of \"note\"\n",
    "note_w2v = pt.word2vec.wv[\"note\"].reshape(1, -1)\n",
    "\n",
    "# Fit the local homology transformer on the example sentence in which 'note' is a verb, and apply it to the \"note\" word vector\n",
    "kn_lh.fit(list_of_vect_sentences[0])\n",
    "mod_pe.fit_transform(kn_lh.transform(note_w2v))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now fit the local homology transformer on the example sentence in which 'note' refers to music, and apply it to the \"note\" word vector again\n",
    "kn_lh.fit(list_of_vect_sentences[1])\n",
    "mod_pe.fit_transform(kn_lh.transform(note_w2v))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now visualize the word2vec embeddings of two different sentences, using UMAP to perform dimensionality reduction:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plotting_the_embedding(i, string):\n",
    "    \"\"\"this function displays the word embedding space reduced\n",
    "    to two dimensions by the UMAP algorithm. In yellow the word\n",
    "    `note` is highlighted.\"\"\"\n",
    "    value = None\n",
    "    # for loop to find the instance of note\n",
    "    for k in range(len(list_of_vect_sentences[i])):\n",
    "        if (list_of_vect_sentences[i][k] == pt.transform((\"note\",))[0]).all():\n",
    "            value = k\n",
    "    temp = np.zeros((len(list_of_vect_sentences[i])))\n",
    "    temp[value] = 1\n",
    "\n",
    "    reducer = UMAP()\n",
    "\n",
    "    scaled_point_cloud = StandardScaler().fit_transform(list_of_vect_sentences[i])\n",
    "\n",
    "    embedding = reducer.fit_transform(scaled_point_cloud)\n",
    "\n",
    "    plt.scatter(\n",
    "        embedding[:, 0],\n",
    "        embedding[:, 1], c = temp)\n",
    "    plt.gca().set_aspect('equal', 'datalim')\n",
    "    plt.title(f'Use of \"note\" as a {string}', fontsize=24)\n",
    "\n",
    "    # Example of sentence with the word \"note\"\n",
    "    print(\"Preprocessed sentence: \")\n",
    "    print(refined_list[i])\n",
    "    kn_lh.fit(list_of_vect_sentences[i])\n",
    "\n",
    "    print(\"First and second Betti numbers:\")\n",
    "    print(mod_pe.fit_transform(kn_lh.transform(np.array(pt.transform((\"note\",))[0], dtype=float))))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note as a verb\n",
    "plotting_the_embedding(0, \"verb\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Musical note\n",
    "plotting_the_embedding(1, \"musical note\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The technique seems promising: the results above show a clear distinction between the local shape of the neighborhoods for the word \"note\" with different meanings."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A small scale statistical exploration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now look at 30 sentences where the word \"note\" takes different meanings which we have hand labelled.\n",
    "For each 30 sentences, we look at the 0th and 1st dimension local homology around note and plot the obtained 2 dimensional pointcloud coloured by the meaning of \"note\" in the sentence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "note_emb = pt.word2vec.wv[\"note\"]\n",
    "note_loc_hom = []\n",
    "for i in range(30):\n",
    "    kn_lh = KNeighborsLocalVietorisRips(n_neighbors=(5, 15),\n",
    "                                     homology_dimensions=(0, 1),\n",
    "                                     collapse_edges=True,\n",
    "                                     n_jobs=-1)\n",
    "    sentence_emb = list_of_vect_sentences[i]\n",
    "    kn_lh.fit(sentence_emb)\n",
    "    note_loc_hom.append(mod_pe.fit_transform(kn_lh.transform(note_emb.reshape(1, -1)))[0])\n",
    "\n",
    "note_loc_hom = np.array(note_loc_hom)\n",
    "\n",
    "plt.scatter(\n",
    "    note_loc_hom[:, 0],\n",
    "    note_loc_hom[:, 1],\n",
    "    c = [0, 1, 2, 2, 1, 1, 2, 2, 1, 3, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 3, 3, 3, 2, 1, 1, 3, 2, 2, 1])\n",
    "plt.gca().set_aspect('equal', 'datalim')\n",
    "plt.title('Local dimension around \"note\"', fontsize=24)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "The method seems promising! In particular, the meaning \"money\" (in yellow) seems to have very varying local dimensions, whereas the other classes seem to be more clustered together. However a lot more work is needed: especially, systematising the anylsis, finding the proper vectorisation of local homology, etc...\n",
    "\n",
    "We really hope that this notebook will tinkle your attention and suggest you new relevant research directions."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
